On the Optimal Reward Function of the Continuous Time Multiarmed Bandit Problem
نویسندگان
چکیده
The optimal reward function associated with the so-called "multiarmed bandit problem" for general Markov-Feller processes is considered. It is shown that this optimal reward function has a simple expression (product form) in terms of individual stopping problems, without any smoothness properties of the optimal reward function neither for the global problem nor for the individual stopping problems. Some results relative to a related problem with switching cost are obtained. Key words, variational inequality, switching problem, bandit problem, dynamic programming, index policy AMS(MOS) subject classifications. 35B37, 49A60, 49B60, 60J25, 93E20
منابع مشابه
Continuous Time Associative Bandit Problems
In this paper we consider an extension of the multiarmed bandit problem. In this generalized setting, the decision maker receives some side information, performs an action chosen from a finite set and then receives a reward. Unlike in the standard bandit settings, performing an action takes a random period of time. The environment is assumed to be stationary, stochastic and memoryless. The goal...
متن کاملFinite-Time Regret Bounds for the Multiarmed Bandit Problem
We show finite-time regret bounds for the multiarmed bandit problem under the assumption that all rewards come from a bounded and fixed range. Our regret bounds after any number T of pulls are of the form a+b logT+c log2 T , where a, b, and c are positive constants not depending on T . These bounds are shown to hold for variants of the popular "-greedy and Boltzmann allocation rules, and for a ...
متن کاملThe Max K-Armed Bandit: A New Model of Exploration Applied to Search Heuristic Selection
The multiarmed bandit is often used as an analogy for the tradeoff between exploration and exploitation in search problems. The classic problem involves allocating trials to the arms of a multiarmed slot machine to maximize the expected sum of rewards. We pose a new variation of the multiarmed bandit—the Max K-Armed Bandit—in which trials must be allocated among the arms to maximize the expecte...
متن کاملOptimal Policies for a Class of Restless Multiarmed Bandit Scheduling Problems with Applications to Sensor Management
Consider the Markov decision problems (MDPs) arising in the areas of intelligence, surveillance, and reconnaissance in which one selects among different targets for observation so as to track their position and classify them from noisy data [9], [10]; medicine in which one selects among different regimens to treat a patient [1]; and computer network security in which one selects different compu...
متن کاملOn the efficiency of Bayesian bandit algorithms from a frequentist point of view
In this contribution, we argue that algorithms derived from the Bayesian modelling of the multiarmed bandit problem are also optimal when evaluated using the frequentist cumulated regret as a measure of performance. We first show that the classical Gittins argument can be applied to convert the finite-horizon Bayesian multiarmed bandit problem into an MDP planning task that is numerically solva...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2016